Screenshot taken from Udacity
Let's take an example the task of detection: Imagine you have a camera looking at the street ahead, and you want to detect where the pedestrians are in front of you so that you don't hit them. How would you use a classifier to do that?
Screenshot taken from Udacity
Here's another example of web search ranking: Imagine you have a search query and you want to find all the web pages on the web that are relevant for that query. How would you use a classifier to do that?
Screenshot taken from Udacity
Let's get started training a logistic classifier
Screenshot taken from Udacity
How are we going to use the scores to perform the classification? Each image that we have as an input can have one and only one possible label so we're going to turn the scores into probabilities. We're going to launch the probability of the correct class to be very close to one and a property for every other class to be close to zero
Screenshot taken from Udacity
The way to turn scores into probabilities is to use the softmax function: $$S(y_i) = \frac{e^{y_i}}{\sum_j e^{y_i}}$$
What's important to know is that it it can take any kind of scores and turn them into proper probabilities. Probabilities sum to one and there will be large when the scores are large and small when the scores are comparatively smaller. Scores in the context of logistic regression are often also called "logits"
Screenshot taken from Udacity
Here's an example solution we take the exponential of the scores and we divide by the sum of the exponential of the scores across the other categories.
Here is the result, notice that the probabilities do sum to 1. Let's add some legends to make this a bit more clear, as you can see the probability of the class 1 increases with the score X it starts near zero and close to one at the same time the probabilities of the other classes start pretty high but then go down to 0.
In [9]:
"""Softmax."""
# https://en.wikipedia.org/wiki/Softmax_function
%matplotlib inline
scores = [3.0, 1.0, 0.2]
import numpy as np
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)
print(softmax(scores))
# Plot softmax curves
import matplotlib.pyplot as plt
x = np.arange(-2.0, 6.0, 0.1)
scores = np.vstack([x, np.ones_like(x), 0.2 * np.ones_like(x)])
plt.plot(x, softmax(scores).T, linewidth=2)
plt.show()
If you multiple the score by 10, what happens?
In [13]:
scores = [3.0, 1.0, 0.2]
print(softmax(scores * 10))
If you mulitply the scores by 10, the scores get either very close to 1.0 or very small.
Screenshot taken from Udacity
If you devide the score by 10, what happens?
Screenshot taken from Udacity
If you increase the size of your outputs, your classifier becomes very confident about its predictions but if you reduce the size of your outputs, classifier becomes very ensure. Keep this in mind for later we want a classifier to not be too sure itself in the beginning and then over time it will gain confidence as it learns
Screenshot taken from Udacity
Next we need a way to represent our labels mathematically. We just said let's have the probabilities for the correct class be close to one and the probability for all the others be close to zero we can write down exactly that:
Screenshot taken from Udacity
For a consistent one hot including we need to pick ones such that each class gets a unique position on the vector. Here's what I picked given this including the most likely classes is C.
Screenshot taken from Udacity
One-hot encoding works very well for most problems until you get into situations where you have tens of thousands or even millions of classes. In that case your vector becomes really really large and as mostly zero everywhere and that becomes very inefficient.
You'll see later how we can deal with this problems by using embeddings.
Screenshot taken from Udacity
What's nice about this approach is that we can now measure how well we're doing by simply comparing two vectors: one that comes out of your classifiers and contains the probabilities of your classes and the one-hot enconding vector that corresponds to your labels.
Screenshot taken from Udacity
Let's see how we can do this in practice.
The natural way to measure the distance between those two probability vectors is called the cross entropy denoted by D here for distance: $$D(S,L) = - \sum_i L_i log(S_i)$$
Be careful! The cross entropy is not symmetric and you have a nasty log in there so you have to make sure that your labels and your distributions are in the right place
Screenshot taken from Udacity
Let's recap because we have a lot of pieces
Screenshot taken from Udacity
This entire setting is often called multinomial logistic classification
Screenshot taken from Udacity
In [ ]: